Chapter 1


Data Merging Basics

Inner Join

It is necessary to understand that inner joins only return the rows with matching values in both tables.

new_df = df1.merge(df2, on='columns_name')

This is give us Inner joined table, combination of all the columns joined with respect to the column_name specified. DO NOTE, that columns that are common in both are tables are repeated with _x/_y attached at the end.

We can change this behavious of adding _x / _y at the end to something that we want by providing the values in suffixes argument.

suffixes=('_ward', '_cen')

One to Many Relationship

image.png

From our End there is no change in syntax. It all depends on the table we are merging with based on the common column.

Excersize

image-2.png

Merging Multiple DataFrames

image.png

image-2.png


Chapter 2

Merging Tables with Different Join Types


Left Join

image.png

how = "left" parameter

image-2.png

NOTE: A left join will return all of the rows from the left table. If those rows in the left table match multiple rows in the right table, then all of those rows will be returned. Therefore, the returned rows must be equal to if not greater than the left table. Knowing what to expect is useful in troubleshooting any suspicious merges.

Right Join

image.png

If the column names are different in both the tables, than we can explicitly specify that using left_on and right_on.

image-2.png

Outer Join

image.png

image-2.png

image-3.png

Merging a Table to Itself: (SELF JOIN)


When will you have to join a table to itself:

  1. Graph Data
  2. Hierarchial relationships
  3. Sequential relationships

This kind of Merger is By Default, an INNER JOIN

image.png

image.png

Merging on Indexes


Merging on Index is pretty much the same as merging on column When Index is common. Eg: image.png

Merging on Multi index with same Index name is also, pretty similar, like this: image.png

The differnt in Method arises when we are merging on differnt Indexes from both tables, For that we need to mention left_index=True and right_index = True to get specify the difference in Indexes. Whenever we are using left_on or right_on argument we need to set the respective left_index or right_index to True

image.png

USE CASE (Examples)

image.png

Chapter 3

Advanced Merging and Concatenation


Fist we gotta look at the difference between Mutating Versus filtering Join

Mutating Joins:

Filtering Joins:

But before this, let's see what is Semi-join ? So, In semi-join, the right table is filtered according to matching values from left table. Similar to INNER JOIN, but unlike that only columns of left table are shown. Like this:

image.png

image.png


Opposite to Semi-Join, Anti-Join excludes the intersection.

image.png

image.png

Concatenate DataFrames together vertically

.concat(), They are combined in the order the names they are passed in. The result is a vertically combined tables. If the index contains no valuable information than we can simply ignore the index by ignore_index=True.

image.png

NOTE: Providing keys and ignoring index can NOT be done simultaneously

image.png

sort = True argument is used to sort the column names in the combined table
join='inner' argument is used to select only the common columns between the tables, and sort doesn't work with it.

.append() :


Verifying Integrity

There are very possible chances to get unintended duplicates, one-to-many, or many-to-many relationships between concatenated tables. So we need to verify our Data integrity.

image.png

image.png

If the merge is not of specific type, then its gonna throw an ERROR

image.png

For concatenation, it checks only Index values for duplicates not column values, checks with verify_integrity = True

image.png


Chapter 4


merge_ordered()

This method can merge time-series or other ordered data. Give sorted result

image.png

Merging 2 tables and then Forward Filling the missing values:

image.png

When to Use:

Keep Note: When we are choosing the columns to merge on, it is important to chose the order of those columns wisely, as according to that the resultant values will change their order.


merge_asof()

Another method for ordered or time series data. It is similar to ordered left join,
However, the match doesn't need to be an exact match, nearby value is considered as well
NOTE: Merged 'on' columns must be sorted !!

DEFAULT VALUE ==> direction = 'backward', but we can change this to: direction = 'forward', this direction argument is used to determine which nearby value to pick, less than or greater than respectively. One more value that we can set to direction is direction = 'nearest', this will chose the nearest irrespective of forward or backward.

image.png

When to Use:

Using merge_asof() to create dataset

The merge_asof() function can be used to create datasets where you have a table of start and stop dates, and you want to use them to create a flag in another table.

Example:

image.png

image.png


.query()

image.png

image.png

image.png

We can use query() method to select strings

image.png


.melt()

Useful to reshaping our table in more computer friendly format.

Wide data is easier to read by people, while long data is often more accessible to computer to work with.

image.png

image.png

image.png

id_vars: this argument stands for identifier variables, these are those variables that remains intact in the wide format i.e. they are not unfolded/ unpivoted.

image.png

value_vars: this argument selects which columns are to be unpivoted into the resultant data, the columns not selected will the removed from the data. Also the order at which the columns names are mentioned is also considered while unpivoting the data.

image.png

var_name and value_name: these arguments are bydefault set to variable and value respectively. But through these arguments we can simply change the column names in unpivoted resultant data.

image.png